Representation Independent Proximity and Similarity Search
نویسندگان
چکیده
Finding similar or strongly related entities in a graph database is a fundamental problem in data management and analytics with applications in similarity query processing, entity resolution, and pattern matching. Similarity search algorithms usually leverage the structural properties of the data graph to quantify the degree of similarity or relevance between entities. Nevertheless, the same information can be represented in many different structures and the structural properties observed over particular representations do not necessarily hold for alternative structures. Thus, these algorithms are effective on some representations and ineffective on others. We postulate that a similarity search algorithm should return essentially the same answers over different databases that represent the same information. We formally define the property of representation independence for similarity search algorithms as their robustness against transformations that modify the structure of databases and preserve their information content. We formalize two widespread groups of such transformations called relationship reorganizing and entity rearranging transformations. We show that current similarity search algorithms are not representation independent under these transformations and propose an algorithm called R-PathSim, which is provably robust under relationship reorganizing transformation and a subset of entity rearranging transformation. We perform an extensive empirical study on the representation independence of current similarity search algorithms using realworld databases under relationship reorganizing and entity rearranging transformations. Our empirical results suggest that current similarity search algorithms except for R-PathSim are highly sensitive to the data representation. These results also indicate that R-PathSim is as effective or more effective than other similarity search algorithms.
منابع مشابه
Heterogeneous Information Network Embedding for Meta Path based Proximity
A network embedding is a representation of a large graph in a lowdimensional space, where vertices are modeled as vectors. The objective of a good embedding is to preserve the proximity (i.e., similarity) between vertices in the original graph. This way, typical search and mining methods (e.g., similarity search, kNN retrieval, classification, clustering) can be applied in the embedded space wi...
متن کاملA New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation
Recommender systems utilize information retrieval and machine learning techniques for filtering information and can predict whether a user would like an unseen item. User similarity measurement plays an important role in collaborative filtering based recommender systems. In order to improve accuracy of traditional user based collaborative filtering techniques under new user cold-start problem a...
متن کاملEffects of search efficiency on surround suppression during visual selection in frontal eye field.
Previous research has shown that visually responsive neurons in the frontal eye field of macaque monkeys select the target for a saccade during efficient, pop-out visual search through suppression of the representation of the nontarget distractors. For a fraction of these neurons, the magnitude of this distractor suppression varied with the proximity of the target to the receptive field, exhibi...
متن کاملTractable Algorithms for Proximity Search on Large Graphs
Identifying the nearest neighbors of a node in a graph is a key ingredient in a diverse set of ranking problems, e.g. friend suggestion in social networks, keyword search in databases, web-spam detection etc. For finding these “near” neighbors, we need graph theoretic measures of similarity or proximity. Most popular graph-based similarity measures, e.g. length of shortest path, the number of c...
متن کاملApproximate Similarity Search in Metric Data by Using Region Proximity
The problem of approximated similarity search for the range and nearest neighbor queries is investigated for generic metric spaces. The search speedup is achieved by ignoring data regions with a small, user de ned, proximity with respect to the query. For zero proximity, exact similarity search is performed. The problem of proximity of metric regions is explained and a probabilistic approach is...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1508.03763 شماره
صفحات -
تاریخ انتشار 2015